Skip to content

feat(errors): structured what/why/fix errors + stable exit codes (error-system v2)#27

Merged
alex-mextner merged 2 commits into
mainfrom
error-system-v2
Jun 17, 2026
Merged

feat(errors): structured what/why/fix errors + stable exit codes (error-system v2)#27
alex-mextner merged 2 commits into
mainfrom
error-system-v2

Conversation

@alex-mextner

Copy link
Copy Markdown
Owner

error-system v2 — structured what/why/fix errors + stable exit codes

Born from two same-day prod failures whose errors were thin and undiagnosable:
unknown mcp item(s): review (known: none) (no hint it was a removed slot) and a
dead hook path surfacing only as a generic harness "PreToolUse error". This makes every
rig error a consistent WHAT / WHY / HOW-to-fix block with a stable per-class exit
code
(a public contract, documented in rig --help).

What's in

  • riglib/errors.pyRigError(what/why/fix/exit_code) + per-class codes
    (2 config, 3 drift, 4 unknown-item, 5 missing-target, 6 not-a-repo,
    127 missing-dep), a single top-level guard() renderer, did-you-mean
    (Levenshtein-nearest), and a removed-slot registry (mcp.items.review → names
    agent-tools feat(status): area-coverage summary — every reconciled area, not "mostly skills" #32 + the exact fix instead of "known: none").
  • riglib/layers.py — GLOBAL vs REPO classification, single source of truth so
    rig status groups drift by layer and names which config file declares each item.
  • riglib/missing_target.py — proactively scans the harness settings.json for
    hook commands pointing at a script gone from disk (the dead-rtk-hook case). Resolves
    the invoked script (skips bare/absolute interpreters like /usr/bin/env,
    /opt/homebrew/bin/python3; honors -c only as an interpreter flag), checks only that
    token so a runtime output arg never false-flags.
  • cli.py — 3-part error rendering + layer-grouped rig status; a non-git dir
    shows not a git repository — repo layer N/A (never the "should be committed" nag) and
    degrades gracefully when the agent-tools catalog can't be resolved. rig doctor honors
    the documented 127 missing-dependency exit code.
  • plan.py raises the structured UnknownItemError; config.py adds
    primary_config_path for error provenance.

Verification

  • uv run --with pytest python -m pytest tests/ → 504 passed.
  • bash tests/smoke.sh → fully green (incl. the non-git, clean-sample, removed-slot,
    and dead-hook legs). Fixed a latent smoke bug: the non-git test dir was nested under the
    smoke's own git init-ed $TMP, so it wasn't actually non-git.
  • Reviewed across models with review --staged (Opus / Codex); their findings on the new
    modules were addressed (HOME-isolated the new status tests; tightened the missing-target
    script resolution for absolute interpreters and -c).

Known follow-ups (out of scope for this PR; CTO to triage)

  • Per-key config provenance: primary_config_path names the repo file even when the
    bad key came from the global layer.
  • Exit-code class precedence: when rig status has both drift and a dead target, it
    returns 3 (drift) and the 5 (missing-target) signal is masked.
  • The new structured errors.ConfigError exists but config.ConfigError is still what's
    raised for malformed config (so YAML errors still render the old one-line form).
  • The non-git CatalogError fallback skips the missing-target scan (a dead ~/.claude
    hook is surfaced in normal status but not in the non-git + no-catalog path).

🤖 Generated with Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4de60aaa3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread riglib/config.py
Comment thread riglib/missing_target.py Outdated
Comment thread riglib/cli.py Outdated
alex-mextner added a commit that referenced this pull request Jun 17, 2026
Three deferred P2 findings from the PR #27 codex review, now fixed properly:

1. config.py — per-key LAYER PROVENANCE for unknown items. The loader now tracks
   which config FILE last declared each top-level key (key_sources, built in
   cascade order). An unknown item that came SOLELY from the global config is
   reported against ~/.config/rig/config.yaml, not the repo rig.yaml. plan.py's
   unknown-item errors resolve the source via config.source_for_key().

2. missing_target.py — SKIP `-m module` invocations. `python3 -m my_hook --out x`
   runs a module by import name, not a script FILE; the scanner now treats `-m`
   like `-c` (interpreter mode, no path to verify), so a healthy module-based hook
   with an absolute --out arg no longer false-flags as a missing target.

3. cli.py — EXIT-CODE PRECEDENCE. A dead hook/target now surfaces its exit code
   (EXIT_MISSING_TARGET=5) even alongside config<->disk drift, instead of being
   masked by the drift exit (3). Both findings are still printed; missing-target
   outranks drift (it fails at runtime). Precedence documented in `rig --help`.

Also makes tests/smoke.sh run pytest via `uv run --with pytest` when uv is
available (the documented command), so the suite runs on a machine whose bare
python3 lacks pytest.

Tests: RED-first for each finding; full suite 504 -> 513 passing; smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
alex-mextner and others added 2 commits June 17, 2026 06:53
…or-system v2)

Born from two same-day prod failures whose errors were thin and undiagnosable:
`unknown mcp item(s): review (known: none)` (no hint it was a REMOVED slot) and a
dead hook path surfacing only as a generic harness "PreToolUse error". This makes
every rig error a consistent WHAT / WHY / HOW-to-fix block with a stable per-class
exit code (a PUBLIC CONTRACT documented in `rig --help`).

What's in:
- riglib/errors.py: RigError(what/why/fix/exit_code) + per-class codes (2 config,
  3 drift, 4 unknown-item, 5 missing-target, 6 not-a-repo, 127 missing-dep), a
  single top-level `guard()` renderer, did-you-mean (Levenshtein), and a
  removed-slot registry (mcp.items.review -> names agent-tools #32 + the fix).
- riglib/layers.py: GLOBAL vs REPO classification — single source of truth so
  `rig status` groups drift by layer and names WHICH config file declares each item.
- riglib/missing_target.py: proactively scans the harness settings.json for hook
  commands pointing at a script gone from disk (the dead-rtk-hook case). Resolves the
  invoked SCRIPT (skips bare/absolute interpreters like /usr/bin/env, honors `-c` only
  as an interpreter flag) and checks only that token, so a runtime output arg never
  false-flags.
- cli.py: `rig status` renders the 3-part errors + layer grouping; a non-git dir shows
  "not a git repository — repo layer N/A" (never the "should be committed" nag), and
  degrades gracefully when the agent-tools catalog can't be resolved. plan.py raises
  the structured UnknownItemError. config.py adds primary_config_path for provenance.
  `rig doctor` honors the documented 127 missing-dependency exit code.

Tests: 504 pytest passing; tests/smoke.sh fully green (incl. the non-git + dead-hook
+ removed-slot legs). New: test_errors, test_missing_target, test_plan_errors,
test_status_layers; smoke gains non-git / clean-sample / removed-slot assertions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three deferred P2 findings from the PR #27 codex review, now fixed properly:

1. config.py — per-key LAYER PROVENANCE for unknown items. The loader now tracks
   which config FILE last declared each top-level key (key_sources, built in
   cascade order). An unknown item that came SOLELY from the global config is
   reported against ~/.config/rig/config.yaml, not the repo rig.yaml. plan.py's
   unknown-item errors resolve the source via config.source_for_key().

2. missing_target.py — SKIP `-m module` invocations. `python3 -m my_hook --out x`
   runs a module by import name, not a script FILE; the scanner now treats `-m`
   like `-c` (interpreter mode, no path to verify), so a healthy module-based hook
   with an absolute --out arg no longer false-flags as a missing target.

3. cli.py — EXIT-CODE PRECEDENCE. A dead hook/target now surfaces its exit code
   (EXIT_MISSING_TARGET=5) even alongside config<->disk drift, instead of being
   masked by the drift exit (3). Both findings are still printed; missing-target
   outranks drift (it fails at runtime). Precedence documented in `rig --help`.

Also makes tests/smoke.sh run pytest via `uv run --with pytest` when uv is
available (the documented command), so the suite runs on a machine whose bare
python3 lacks pytest.

Tests: RED-first for each finding; full suite 504 -> 513 passing; smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alex-mextner alex-mextner merged commit 4aff90f into main Jun 17, 2026
12 checks passed
@alex-mextner alex-mextner deleted the error-system-v2 branch June 17, 2026 04:55
alex-mextner added a commit that referenced this pull request Jun 17, 2026
…ixture

#27's clean-sample leg enumerates every default-ON category and disables
it to assert zero false drift. tg_ctl (added by this PR) is default-ON, so
its provision_tg_ctl action made the clean sample report drift (exit 3).
Mirror how #29 added 'gitignore: {enabled: false}' and disable tg_ctl too.
alex-mextner added a commit that referenced this pull request Jun 17, 2026
* feat(rig): add tg_ctl config block + pure plist planning

Add the tg_ctl config block (validate + plan) and the pure, effect-free
TgCtlPlan that renders the ai.hyperide.tg-ctl.plist LaunchAgent XML
byte-exact to the working hand-created file (sort_keys=False preserves the
insertion order so a re-apply is a true no-op). Default-on, per-machine
(GLOBAL layer), macOS-only. Mirrors the tmux block's schema style.

boot:null and label:null resolve to their defaults (not bool(None)=False /
str(None)="None").

Reviewed via multi-model `review`; findings addressed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(rig): provision + reconcile tg-ctl boot LaunchAgent

Runner: _do_provision_tg_ctl writes the byte-exact plist, backs up a
differing prior, ensures the log dir, tears down the stale predecessor
(com.ultra.codex-tg-bot: bootout + timestamped backup + remove), and
(re)loads via launchctl bootout/bootstrap in the gui/<uid> domain. A
re-apply against the already-correct loaded plist is a skipped no-op.
RIG_TG_CTL_DRY_RUN writes the plist but skips every live/destructive
mutation (launchctl AND the stale teardown) so tests/smoke never touch the
real launchd domain.

Drift: _check_tg_ctl flags missing / divergent / written-but-not-loaded, a
leftover plist when boot:false, and the stale predecessor (extra). CLI:
GLOBAL status line shows installed / drifted / disabled / unsupported
(off-darwin), resolved through the shared plan builder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(rig): tg-ctl unit suite + HOME-isolated smoke leg

test_tg_ctl.py mirrors test_tmux.py: config validation, byte-exact plist
render (incl. against the live machine plist when present, read-only),
create/idempotent/conflict/dry-run states, stale-predecessor teardown, drift
(missing/modified/extra/not-loaded), status states, and the boot:null /
label:null / dry-run-no-stale-removal / off-darwin regressions.

conftest neutralizes the default-on tg_ctl provisioner + drift check and
stubs the gui-domain launchctl seams suite-wide (dedicated tests restore the
real ones with their own HOME-isolated tmp dirs); no test ever touches the
real ~/Library/LaunchAgents or runs real launchctl. smoke.sh gains a
focused, HOME-isolated, RIG_TG_CTL_DRY_RUN tg-ctl leg and prefers
`uv run --extra test pytest`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* docs(rig): document the tg_ctl config block

docs/config-schema.md: the tg_ctl section (keys, defaults, the byte-exact
no-op contract, gui-domain (re)load, stale-predecessor teardown, drift, the
RIG_TG_CTL_DRY_RUN seam, and the enabled:false vs boot:false distinction) +
the validation paragraph. AGENTS.md: refine the "never mutate a LIVE service"
rule — the stateless background daemons (models cron, tg_ctl) are the
documented (re)load exceptions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(rig): disable tg_ctl in the error-system-v2 clean-sample smoke fixture

#27's clean-sample leg enumerates every default-ON category and disables
it to assert zero false drift. tg_ctl (added by this PR) is default-ON, so
its provision_tg_ctl action made the clean sample report drift (exit 3).
Mirror how #29 added 'gitignore: {enabled: false}' and disable tg_ctl too.

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant